Skip to main content

Bronze Autoloader Generic

Document Version: 1.0
Last Updated: 20-04-2026


02_bronze_autoloader_generic

Purpose

02_bronze_autoloader_generic is the shared ingestion notebook that loads source files into a bronze Delta table using Databricks Auto Loader.

It is designed to be reusable across many feeds by changing widget parameters instead of cloning notebook logic.

What the active implementation does

The uploaded version of this notebook performs the following active steps:

  1. Reads widget parameters.
  2. Validates required core values.
  3. Resolves the schema JSON file path relative to the current notebook.
  4. Loads the schema from JSON into a Spark StructType.
  5. Configures a cloudFiles streaming reader.
  6. Applies source-format options such as CSV delimiter and header handling.
  7. Adds standard ingestion metadata columns.
  8. Writes the stream to the target Delta table using availableNow=True.

Read pattern

The notebook uses:

  • spark.readStream
  • format cloudFiles
  • cloudFiles.format = source_format
  • cloudFiles.schemaLocation = {checkpoint_path}/_schemas
  • cloudFiles.rescuedDataColumn = rescued_data_column

This is Databricks Auto Loader, which is well suited for incremental file discovery in cloud storage / Unity Catalog volumes.

Write pattern

The notebook writes with:

  • .format("delta")
  • .outputMode(output_mode)
  • .option("checkpointLocation", checkpoint_path)
  • .option("mergeSchema", str(merge_schema).lower())
  • .trigger(availableNow=True)
  • .toTable(target_table_name)

availableNow=True means each job run behaves like a bounded ingestion run that processes all currently available files and then stops.

Metadata columns added by the notebook

The notebook enriches ingested rows with standard bronze metadata:

  • w_business_ts
  • w_target_table_name
  • w_load_type
  • w_run_date
  • w_ingest_ts
  • w_source_file_name
  • w_ingestion_run_id
  • w_source_system
  • w_job_name
  • w_task_name
  • w_job_id
  • w_job_run_id
  • w_task_run_id
  • w_job_trigger_type
  • w_job_start_ts

These fields make downstream traceability and support much easier.

Schema handling

The notebook requires schema_file_path and reads the schema file as JSON.

Path resolution behavior

schema_file_path can be provided in one of these forms:

  • /Workspace/...
  • /some/workspace/relative/path
  • ./Schemas/schema_x.json

If the path is relative, the notebook resolves it relative to the notebook directory in the workspace.

This is useful when keeping notebook and schema assets together in the same folder structure.

CSV-specific behavior

When source_format is csv, the notebook applies:

  • sep = delimiter
  • header = header
  • nullValue = null_value

For other source formats, those options are ignored unless you extend the notebook.

Checkpointing and schema tracking

Two storage locations matter:

checkpoint_path

Used for Structured Streaming checkpoint state. This path must be stable for the job and should not be casually changed after go-live.

schema_location

Derived automatically as:

{checkpoint_path}/_schemas

This is where Auto Loader stores schema tracking information.

Current limitations and reserved parameters

The notebook includes parameters such as:

  • staging_table_name
  • business_keys
  • overwrite_schema
  • cleanup_stage_after_finalize

The uploaded implementation currently does not actively use those parameters in the live execution path.

There are commented sections that suggest an extended design involving:

  • staging table writes
  • row counting from a staged run
  • high-watermark processing
  • finalize logic for snapshot / incremental handling
  • cleanup of staged rows

Because those blocks are commented out, this documentation should not claim that the generic notebook currently performs those steps unless your environment has a modified version.

load_type in the current implementation

load_type is currently written as metadata (w_load_type) and forwarded to the audit notebook. In the uploaded active path, it does not yet change the write behavior by itself.

That means values like snapshot and incremental are still useful for lineage and future compatibility, but they do not independently change ingestion semantics in this version unless you extend the notebook.

Operational guidance

Keep one checkpoint path per source/table

Do not share the same checkpoint path across unrelated feeds. Each logical ingestion should have its own checkpoint directory.

Keep schema files under source control

The schema JSON is part of the ingestion contract. Store it with the notebook project and update it through normal change control.

Use precise file patterns

source_file_pattern helps prevent accidentally ingesting unrelated files from the same landing folder.

Use a rescued data column

Keep _rescued_data enabled unless you have a strong reason not to. It helps preserve malformed or unexpected fields for investigation.

Common issues

No files loaded

Possible causes:

  • landing path is empty
  • file pattern does not match incoming files
  • checkpoint already recorded those files
  • source path is wrong

Schema file not found

Possible causes:

  • relative path resolved incorrectly
  • schema file not deployed with the notebooks
  • incorrect workspace path syntax

Unexpected schema errors

Possible causes:

  • schema JSON does not match actual file layout
  • delimiter or header configuration is wrong
  • merge behavior expectations do not match the notebook's active logic